{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COMPSCI 389: Introduction to Machine Learning\n",
    "# Topic 2.1: Pandas and Data Sets\n",
    "\n",
    "This notebook provides a description of how data sets are represented and manipulated using the `pandas` library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is pandas?\n",
    "\n",
    "Pandas stands for \"PANel DAta,\" an econometric term for data sets. Webpage: [link](https://pandas.pydata.org/docs/index.html).\n",
    "\n",
    "It provides two main objects: a **DataFrame** and a **Series**.\n",
    "\n",
    "A DataFrame object stores a 2-dimensional table of data, while a Series stores a 1-dimensional vector of data.\n",
    "\n",
    "Pandas provides useful functions for working with these objects including functions for:\n",
    "1. Loading data sets from files and storing them in DataFrame and/or Series objects.\n",
    "2. Manipulating DataFrame and Series objects (e.g., adding or removing features).\n",
    "3. Computing statistics of the data (e.g., the minimum and maximum values of features).\n",
    "\n",
    "Pandas has become so common that many other ML libraries in python are built to be compatible with pandas, as we will see below.\n",
    "\n",
    "To install pandas, run the following command in the console or command line:\n",
    "\n",
    "> pip install pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Data Sets\n",
    "\n",
    "In the remainder of this notebook be load and inspect a few example data sets for supervised learning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPA Data\n",
    "\n",
    "The GPA data set contains data about undergraduate students and the *Universidade Federal do Rio Grande do Sul* (UFRGS) in Brazil.\n",
    "\n",
    "**Input**: Scores on 9 entrance exams: \n",
    "1. Physics\n",
    "2. Biology\n",
    "3. History\n",
    "4. English\n",
    "5. Geography\n",
    "6. Literature\n",
    "7. Portuguese\n",
    "8. Math\n",
    "9. Chemistry\n",
    "\n",
    "**Output**: GPA on a 4.0 scale during the first three semesters at university.\n",
    " - The GPA can be used for regression (predict the GPA) or classification (predict the GPA range, e.g., whether it is at least 3.0).\n",
    "\n",
    "**Data set Size**: 43,303\n",
    "\n",
    "Let's start by loading and displaying this data set. The data set is available here:\n",
    "\n",
    "[https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv](https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv)\n",
    "\n",
    "You can download it and place it inside a directory called `data`, next to this .ipynb file, and can load the data set from this local copy, or you can directly load it from the online posting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>physics</th>\n",
       "      <th>biology</th>\n",
       "      <th>history</th>\n",
       "      <th>English</th>\n",
       "      <th>geography</th>\n",
       "      <th>literature</th>\n",
       "      <th>Portuguese</th>\n",
       "      <th>math</th>\n",
       "      <th>chemistry</th>\n",
       "      <th>gpa</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>622.60</td>\n",
       "      <td>491.56</td>\n",
       "      <td>439.93</td>\n",
       "      <td>707.64</td>\n",
       "      <td>663.65</td>\n",
       "      <td>557.09</td>\n",
       "      <td>711.37</td>\n",
       "      <td>731.31</td>\n",
       "      <td>509.80</td>\n",
       "      <td>1.33333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>538.00</td>\n",
       "      <td>490.58</td>\n",
       "      <td>406.59</td>\n",
       "      <td>529.05</td>\n",
       "      <td>532.28</td>\n",
       "      <td>447.23</td>\n",
       "      <td>527.58</td>\n",
       "      <td>379.14</td>\n",
       "      <td>488.64</td>\n",
       "      <td>2.98333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>455.18</td>\n",
       "      <td>440.00</td>\n",
       "      <td>570.86</td>\n",
       "      <td>417.54</td>\n",
       "      <td>453.53</td>\n",
       "      <td>425.87</td>\n",
       "      <td>475.63</td>\n",
       "      <td>476.11</td>\n",
       "      <td>407.15</td>\n",
       "      <td>1.97333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>756.91</td>\n",
       "      <td>679.62</td>\n",
       "      <td>531.28</td>\n",
       "      <td>583.63</td>\n",
       "      <td>534.42</td>\n",
       "      <td>521.40</td>\n",
       "      <td>592.41</td>\n",
       "      <td>783.76</td>\n",
       "      <td>588.26</td>\n",
       "      <td>2.53333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>584.54</td>\n",
       "      <td>649.84</td>\n",
       "      <td>637.43</td>\n",
       "      <td>609.06</td>\n",
       "      <td>670.46</td>\n",
       "      <td>515.38</td>\n",
       "      <td>572.52</td>\n",
       "      <td>581.25</td>\n",
       "      <td>529.04</td>\n",
       "      <td>1.58667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43298</th>\n",
       "      <td>519.55</td>\n",
       "      <td>622.20</td>\n",
       "      <td>660.90</td>\n",
       "      <td>543.48</td>\n",
       "      <td>643.05</td>\n",
       "      <td>579.90</td>\n",
       "      <td>584.80</td>\n",
       "      <td>581.25</td>\n",
       "      <td>573.92</td>\n",
       "      <td>2.76333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43299</th>\n",
       "      <td>816.39</td>\n",
       "      <td>851.95</td>\n",
       "      <td>732.39</td>\n",
       "      <td>621.63</td>\n",
       "      <td>810.68</td>\n",
       "      <td>666.79</td>\n",
       "      <td>705.22</td>\n",
       "      <td>781.01</td>\n",
       "      <td>831.76</td>\n",
       "      <td>3.81667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43300</th>\n",
       "      <td>798.75</td>\n",
       "      <td>817.58</td>\n",
       "      <td>731.98</td>\n",
       "      <td>648.42</td>\n",
       "      <td>751.30</td>\n",
       "      <td>648.67</td>\n",
       "      <td>662.05</td>\n",
       "      <td>773.15</td>\n",
       "      <td>835.25</td>\n",
       "      <td>3.75000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43301</th>\n",
       "      <td>527.66</td>\n",
       "      <td>443.82</td>\n",
       "      <td>545.88</td>\n",
       "      <td>624.18</td>\n",
       "      <td>420.25</td>\n",
       "      <td>676.80</td>\n",
       "      <td>583.41</td>\n",
       "      <td>395.46</td>\n",
       "      <td>509.80</td>\n",
       "      <td>2.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43302</th>\n",
       "      <td>512.56</td>\n",
       "      <td>415.41</td>\n",
       "      <td>517.36</td>\n",
       "      <td>532.37</td>\n",
       "      <td>592.30</td>\n",
       "      <td>382.20</td>\n",
       "      <td>538.35</td>\n",
       "      <td>448.02</td>\n",
       "      <td>496.39</td>\n",
       "      <td>3.16667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>43303 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       physics  biology  history  English  geography  literature  Portuguese  \\\n",
       "0       622.60   491.56   439.93   707.64     663.65      557.09      711.37   \n",
       "1       538.00   490.58   406.59   529.05     532.28      447.23      527.58   \n",
       "2       455.18   440.00   570.86   417.54     453.53      425.87      475.63   \n",
       "3       756.91   679.62   531.28   583.63     534.42      521.40      592.41   \n",
       "4       584.54   649.84   637.43   609.06     670.46      515.38      572.52   \n",
       "...        ...      ...      ...      ...        ...         ...         ...   \n",
       "43298   519.55   622.20   660.90   543.48     643.05      579.90      584.80   \n",
       "43299   816.39   851.95   732.39   621.63     810.68      666.79      705.22   \n",
       "43300   798.75   817.58   731.98   648.42     751.30      648.67      662.05   \n",
       "43301   527.66   443.82   545.88   624.18     420.25      676.80      583.41   \n",
       "43302   512.56   415.41   517.36   532.37     592.30      382.20      538.35   \n",
       "\n",
       "         math  chemistry      gpa  \n",
       "0      731.31     509.80  1.33333  \n",
       "1      379.14     488.64  2.98333  \n",
       "2      476.11     407.15  1.97333  \n",
       "3      783.76     588.26  2.53333  \n",
       "4      581.25     529.04  1.58667  \n",
       "...       ...        ...      ...  \n",
       "43298  581.25     573.92  2.76333  \n",
       "43299  781.01     831.76  3.81667  \n",
       "43300  773.15     835.25  3.75000  \n",
       "43301  395.46     509.80  2.50000  \n",
       "43302  448.02     496.39  3.16667  \n",
       "\n",
       "[43303 rows x 10 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd                             # Import pandas\n",
    "\n",
    "# Load the data set directly from the online link, assuming numbers are separated by commas\n",
    "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n",
    "\n",
    "# Load the data set from a local `data` directory, assuming numbers are separated by commas\n",
    "# df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n",
    "\n",
    "# print(df)                                     # Prints a string representation of the DataFrame\n",
    "display(df)                                     # Renders an HTML table (for Jupyter Notebooks - don't use in .py file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question**: Is each column numerical, categorical, text, or an image? Continuous, discrete, nominal, or ordinal?\n",
    "\n",
    "**Answer**: All of these columns are numerical and continuous.\n",
    "\n",
    "**Question**: If the GPAs were binned into letter grades A, B, C, ..., F, would they be numerical, categorical, text, or an image? Continuous, discrete, nominal, or ordinal?\n",
    "\n",
    "**Answer**: In this case the GPAs would be categorical, and specifically ordinal.\n",
    "\n",
    "Notice that pandas views this as a table with rows and columns. Hence features *and* labels are viewed as \"columns\" when using pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manipulating DataFrames\n",
    "\n",
    "In this section we give some examples of how DataFrames can be used to compute statistics of data and how DataFrames can be manipulated.\n",
    "\n",
    "First, let's use the `iloc` (integer-location based indexing for selection by position) function in pandas to split the dataset into the input features $X$ and the targets/labels $y$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>physics</th>\n",
       "      <th>biology</th>\n",
       "      <th>history</th>\n",
       "      <th>English</th>\n",
       "      <th>geography</th>\n",
       "      <th>literature</th>\n",
       "      <th>Portuguese</th>\n",
       "      <th>math</th>\n",
       "      <th>chemistry</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>622.60</td>\n",
       "      <td>491.56</td>\n",
       "      <td>439.93</td>\n",
       "      <td>707.64</td>\n",
       "      <td>663.65</td>\n",
       "      <td>557.09</td>\n",
       "      <td>711.37</td>\n",
       "      <td>731.31</td>\n",
       "      <td>509.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>538.00</td>\n",
       "      <td>490.58</td>\n",
       "      <td>406.59</td>\n",
       "      <td>529.05</td>\n",
       "      <td>532.28</td>\n",
       "      <td>447.23</td>\n",
       "      <td>527.58</td>\n",
       "      <td>379.14</td>\n",
       "      <td>488.64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>455.18</td>\n",
       "      <td>440.00</td>\n",
       "      <td>570.86</td>\n",
       "      <td>417.54</td>\n",
       "      <td>453.53</td>\n",
       "      <td>425.87</td>\n",
       "      <td>475.63</td>\n",
       "      <td>476.11</td>\n",
       "      <td>407.15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>756.91</td>\n",
       "      <td>679.62</td>\n",
       "      <td>531.28</td>\n",
       "      <td>583.63</td>\n",
       "      <td>534.42</td>\n",
       "      <td>521.40</td>\n",
       "      <td>592.41</td>\n",
       "      <td>783.76</td>\n",
       "      <td>588.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>584.54</td>\n",
       "      <td>649.84</td>\n",
       "      <td>637.43</td>\n",
       "      <td>609.06</td>\n",
       "      <td>670.46</td>\n",
       "      <td>515.38</td>\n",
       "      <td>572.52</td>\n",
       "      <td>581.25</td>\n",
       "      <td>529.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43298</th>\n",
       "      <td>519.55</td>\n",
       "      <td>622.20</td>\n",
       "      <td>660.90</td>\n",
       "      <td>543.48</td>\n",
       "      <td>643.05</td>\n",
       "      <td>579.90</td>\n",
       "      <td>584.80</td>\n",
       "      <td>581.25</td>\n",
       "      <td>573.92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43299</th>\n",
       "      <td>816.39</td>\n",
       "      <td>851.95</td>\n",
       "      <td>732.39</td>\n",
       "      <td>621.63</td>\n",
       "      <td>810.68</td>\n",
       "      <td>666.79</td>\n",
       "      <td>705.22</td>\n",
       "      <td>781.01</td>\n",
       "      <td>831.76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43300</th>\n",
       "      <td>798.75</td>\n",
       "      <td>817.58</td>\n",
       "      <td>731.98</td>\n",
       "      <td>648.42</td>\n",
       "      <td>751.30</td>\n",
       "      <td>648.67</td>\n",
       "      <td>662.05</td>\n",
       "      <td>773.15</td>\n",
       "      <td>835.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43301</th>\n",
       "      <td>527.66</td>\n",
       "      <td>443.82</td>\n",
       "      <td>545.88</td>\n",
       "      <td>624.18</td>\n",
       "      <td>420.25</td>\n",
       "      <td>676.80</td>\n",
       "      <td>583.41</td>\n",
       "      <td>395.46</td>\n",
       "      <td>509.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43302</th>\n",
       "      <td>512.56</td>\n",
       "      <td>415.41</td>\n",
       "      <td>517.36</td>\n",
       "      <td>532.37</td>\n",
       "      <td>592.30</td>\n",
       "      <td>382.20</td>\n",
       "      <td>538.35</td>\n",
       "      <td>448.02</td>\n",
       "      <td>496.39</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>43303 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       physics  biology  history  English  geography  literature  Portuguese  \\\n",
       "0       622.60   491.56   439.93   707.64     663.65      557.09      711.37   \n",
       "1       538.00   490.58   406.59   529.05     532.28      447.23      527.58   \n",
       "2       455.18   440.00   570.86   417.54     453.53      425.87      475.63   \n",
       "3       756.91   679.62   531.28   583.63     534.42      521.40      592.41   \n",
       "4       584.54   649.84   637.43   609.06     670.46      515.38      572.52   \n",
       "...        ...      ...      ...      ...        ...         ...         ...   \n",
       "43298   519.55   622.20   660.90   543.48     643.05      579.90      584.80   \n",
       "43299   816.39   851.95   732.39   621.63     810.68      666.79      705.22   \n",
       "43300   798.75   817.58   731.98   648.42     751.30      648.67      662.05   \n",
       "43301   527.66   443.82   545.88   624.18     420.25      676.80      583.41   \n",
       "43302   512.56   415.41   517.36   532.37     592.30      382.20      538.35   \n",
       "\n",
       "         math  chemistry  \n",
       "0      731.31     509.80  \n",
       "1      379.14     488.64  \n",
       "2      476.11     407.15  \n",
       "3      783.76     588.26  \n",
       "4      581.25     529.04  \n",
       "...       ...        ...  \n",
       "43298  581.25     573.92  \n",
       "43299  781.01     831.76  \n",
       "43300  773.15     835.25  \n",
       "43301  395.46     509.80  \n",
       "43302  448.02     496.39  \n",
       "\n",
       "[43303 rows x 9 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0        1.33333\n",
       "1        2.98333\n",
       "2        1.97333\n",
       "3        2.53333\n",
       "4        1.58667\n",
       "          ...   \n",
       "43298    2.76333\n",
       "43299    3.81667\n",
       "43300    3.75000\n",
       "43301    2.50000\n",
       "43302    3.16667\n",
       "Name: gpa, Length: 43303, dtype: float64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "X = df.iloc[:, :-1] # All columns except the last as features. This creates a new DataFrame X.\n",
    "print(type(X))      # Confirm that this is actually a new DataFrame by printing the type of X.\n",
    "y = df.iloc[:, -1]  # The last column contains the labels. This creates a new Series (like a 1-dimensional DataFrame) y\n",
    "display(X)          # Display the input columns\n",
    "display(y)          # Display the output (label) column"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the variable `y` displays differently from `X`. This is because `y` is a Series (1-dimensional vector), while `X` is a DataFrame (2-dimensional matrix/table).\n",
    "\n",
    "Also, in the output of the above block, `float64` means that each element in the `y` Series is a floating point number represented with 64 bits."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}